Have you wondered how information spreads on Twitter, how Instagram influencers are identified, and how different actors in an online community collaborate or confront one and another? There are the sorts of questions that can be best answered using network analysis and network visualization. In network analysis of internet communities, we visualize and quantify the structure of social relationships and information flows.
In this tutorial, we will work with a pre-generated network from OSOME Network Tool.
Here, you can see a retweet network based on #Ukraine from July 9, 2022 to August 8, 2022. In this network, a pair of users represents a retweeting relationship. That is, two users are connected to one and another if one retweets or is retweeted by the other. For simplicity, the graph below only shows users who at least twice retweeted or were retweeted by others.
Guess how the size and color of a node is determined.
Where do we begin to visualize a network? It all starts with an edgelist. The table below shows a portion of the edgelist.
An edgelist shows all edges in a network along with attributes of the edges. An edge is a pair of relationship between two nodes (in this case, users). An edge can be directed: for example, A retweets B will be expressed as User A → User B, whereas B retweets A is expressed as User B → User A. But, in some cases, an edge is undirected. Think about your Facebook relationships. If user A is a friend of user B. By default, user B is also connected to user A.
In the edgelist below. The column from_label lists the Twitter users who retweeted. The column to_label shows those users who were on the receiving end of the retweets (i.e., users who were retweeted by others). The weight column is edge weight, referring to the number of retweeting that occurred between the same pair of users.
In our example, a node is a Twitter user. Below is a list of nodes, with their id, labels, and attributes (e.g., size, color).
We will try some of the basics using two libraries igraph and VisNetwork. igraph comes with some in-built functions for visualization. VisNetwork takes a step further by making it prettier and interactive.
There are several common file types that contain network information. The file types include graphml, gml or just a regular dataframe. Below, we import an external .gml file into R.
library(igraph)
nx <- read_graph("RTNetwork_#ukraine.gml",format = c("gml"))
nx
## IGRAPH f4db31b D-W- 16260 44947 --
## + attr: id (v/n), label (v/c), type (e/c), weight (e/n), tweetid (e/c)
## + edges from f4db31b:
## [1] 1->10441 1-> 1998 1-> 3816 1->12068 1-> 7606 1->13186 1-> 6565
## [8] 1-> 6447 1-> 5034 1-> 1348 1-> 4391 1-> 7491 1-> 7681 1->12611
## [15] 1-> 1891 1->12796 1-> 6688 2-> 6447 2->12740 2->13859 3-> 9477
## [22] 3-> 6447 4-> 214 4-> 9032 4-> 822 4-> 6230 4-> 4876 4->11824
## [29] 4-> 4607 4-> 527 4->10465 4-> 1413 4-> 1712 4-> 8304 4-> 6559
## [36] 5-> 6466 5-> 268 5->14636 5-> 1481 6-> 1628 6->10173 7-> 3816
## [43] 7->11283 7-> 6447 7->12796 7-> 9757 7->11786 7->11647 7->10711
## [50] 7-> 252 9->14806 9-> 5139 10-> 5179 10-> 6076 10-> 3313 11-> 9904
## + ... omitted several edges
Let’s just take a look at some network-level indicators.
Run the code below to get a count of edges and nodes in rtnet and mtnet.
vcount(nx) #this shows the number of nodes/vertices
## [1] 16260
ecount(nx) #this shows the number of edges
## [1] 44947
A densely connected network (high density score) is a type of network in which many users are interconnected, whereas a sparse network (low density) is a network in which only a few are interconnected. Two contrasting examples of dense and sparse networks are a network of people in a family gathering in which almost everyone knows everyone else, and a network of people sitting on a public bus.
Is it a dense network?
edge_density(nx, loops = FALSE)
## [1] 0.0001700146
Think of centralization as a question of inequality and who is in control. In a centralized network, a small number of nodes (users) control the information flow. In a retweet network specifically, it means that only a handful of users retweet or are retweeted by others. Centralized and decentralized networks have different ramification for the diffusion of ideas, norms, and effective mobilization.
by setting mode = c(“in”), we calculate the centralization score based on the extent to which users are retweeted by others (as opposed to retweeting others).
So, is it centralized?
#Calculate centralization
centr_degree(nx, mode = c("in"), loops = TRUE,normalized = TRUE)$centralization
## [1] 0.1575273
Have you heard of the saying birds of a feather flock together? In a network, nodes tend to cluster together based on some shared attributes. For instance, Twitter users may retweet mostly content they agree with. Hence, this tendency will result in a cluster of nodes based on similar mindsets or opinions. To what extent is a network reflecting this pattern of clustering can be quantified by using Modularity score. A higher number means a more divided network.
wtc <- cluster_walktrap(nx)
modularity(wtc)
## [1] 0.519009
Reciprocity is calculated as the proportion of reciprocated ties. In the retweet network, for example, reciprocity shows the extent to which a pair of users have mutually retweeted one and another.
Which form of Twitter interactions (retweet vs. mention) is more reciprocal?
reciprocity(nx)
## [1] 0.007252987
I have introduced previously a range of indicators to quantify a network. Such indicators are only useful when it involves a comparison of different networks. When analyzing one single network, we are more interested in node-level indicators.
A common task in network analysis is identifying influencers? An influencer could mean different things to different people. Here we try a couple of dfferent metrics.
indegree centrality measures the number of incoming connections a user has received. A high indegree in the retweet network means that the user is frequently retweeted by others. Do you agree that the most retweeted users are influencers? And why?
outdegree centrality measures the number of outgoing connections a user has. A high outdegree in the retweet network means that the user frequently retweets other users. What would you call such users, mobilizers?
Betweenness centrality measures the number of times a node lies on the shortest path between other nodes. We use this metric to find users who act as ‘bridges’ between nodes in a network and who influence the information flow around a network.
V(nx)$indegree <- degree(nx,mode = "in")
V(nx)$outdegree <- degree(nx,mode = "out")
V(nx)$bt <- betweenness(nx,directed=T, weights=NA)
In the code below, we calculate the aforementioned centrality measures and then add the the centrality measures as node attributes, which will be included as columns in the nodelist. Below is a preview of the nodelist with centrality scores added.
nodelist1 <- vertex_attr(nx)
nodelist1 <- as.data.frame(nodelist1)
datatable(nodelist1, options = list(pageLength = 10))
We use community detection algorithm to cluster users into different groups (we call such groups clusters or cliques). Users in the same cluster are more connected with one and another than with users outside of the cluster. By using the community detection method, we can reveal important divisions and fragmentation that exist due to different opinions, values, and user characteristics.
Some community detection algorithms require intensive computating. It may take a long time to produce an output.
k-core
Creating k-core is fast and easy. We can use k-core to identify a small subset of users who are the most interconnected. In a k-core, each node has at least k connections with everyone else. Below we extract a 2-core (named twocore) in which each user has at least 2 edges with any other users in the core.
kcore <- coreness(nx, mode="all")
twocore <- induced_subgraph(nx, kcore>=2)
cluster_walktrap
This is one of the community detection algorithm that is computationally intensive. Be patient when it is crunching numbers for you.
The code above creates an object call ceb. It contains the information about which cluster each node belongs to. We can run the code below to see the cluster ID of the first 10 nodes.
ceb <- cluster_walktrap(nx)
Before you visualize a network, here are the decisions you need to make:
In our example below, we color nodes based on the clusters they belong to. We set the node size based on PageRank score (the famous scoring technique used by Google), with central nodes represented by bigger nodes. And we don’t want to show all nodes as that will create a messy network; Instead, we would show only the most interconnected subset (using k-core).
In the previous steps, we know the codes for calculating node and network-level metrics (e.g., centrality). Here, we will pass the metrics to nodes and store them as node attributes. This will allows the visualization code to pick up the attributes and use them for sizing and coloring.
In the code below, we add PageRank score (used for node size) and the cluster id (used for assigning color). We use V(rt) to access node attributes and E(rt) to access edge attributes. Since we visualize only the 3-core. We then create a subset of the network.
library(igraph)
library(visNetwork)
library(scales)
V(nx)$size <- betweenness(nx,directed=T, weights=NA)/1000 #set node size by Betweenness centrality scores.
wc <- cluster_walktrap(nx)
V(nx)$color <- membership(wc) # set color by subgroup id
kcore <- coreness(nx, mode="all")
threecore <- induced_subgraph(nx, kcore>=3)
Find a visualization algorithm that fits
And we visualize it. Notice that we set layout = “layout_nicely”? This is how we specify which visualization algorithm to use. There is a whole bunch of them: see the listing. If you are curious about visual effects from different algorithms, try layout =“layout_in_circle” or layout =“layout_with_kk” or layout =“layout_with_sugiyama”
visIgraph(threecore,idToLabel = TRUE,layout = "layout_nicely") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)